Improving State-of-the-Art Continuous Speech Recognition Systems Using the N-Best Paradigm with Neural Networks
نویسندگان
چکیده
In an effort to advance the state of the art in continuous speech recognition employing hidden Markov models (HMM), Segmental Neural Nets (SNN) were introduced recently to ameliorate the wellknown limitations of HMMs, namely, the conditional-independence limitation and the relative difficulty with which HMMs can handle segmental features. We describe a hybrid SNN/I-IMM system that combines the speed and performance of our HMM system with the segmental modeling capabilities of SNNs. The integration of the two acoustic modeling techniques is achieved successfully via the N-best rescoring paradigm. The N-best lists are used not only for recognition, but also during training. This discriminative training using N-best is demonstrated to improve performance. When tested on the DARPA Resource Management speaker-independent corpus, the hybrid SNN/HMM system decreases the error by about 20% compared to the state-of-the-art HMM system. I N T R O D U C T I O N In February 1991, we introduced at the DARPA Speech and Natural Language Workshop the concept of a Segmental Neural Net (SNN) for phonetic modeling in continuous speech recognition [1]. The SNN was introduced to overcome some of the well-known limitations of hidden Markov models (HMM) which now represent the state of the art in continuous speech recognition (CSR). Two such limitations are (i) the conditional-independence assumption, which prevents a HMM from taking full advantage of the correlation that exists among the frames of a phonetic segment, and (ii) the awkwardness with which segmental features (such as duration) can be incorporated into HMM systems. We developed the concept of SNN specifically to overcome the two HMM limitations just mentioned for phonetic modeling in speech. However, neural nets are known to require a large amount of computation, especially for training. Also, there is no known efficient search technique for finding the best scoring segmentation with neural nets in continuous speech. Therefore, we have developed a hybrid SNN/HMM system that is designed to take full advantage of the good properties of both methods: the phonetic modeling properties of SNNs and the good computational properties of HMMs. The two methods are integrated through the use of the Nbest paradigm, which was developed in conjunction with the BYBLOS system at BBN [7,6]. A year ago, we presented very preliminary results using our hybrid system on the speaker-dependent portion of the DARPA Resource Management Corpus [1]. Also, the training of the neural net was performed only on the correct transcription of the utterances. In this paper, we describe the performance of the hybrid system on the speaker-independent portion of the Resource Management corpus, using discriminative training on the whole N-best list. Below, we give a description of the SNN, the integration of the SNN with the HMM models using the N-best paradigm, the training of the hybrid SNN/I-IMM system using the whole N-best list, and the results on a development set. S E G M E N T A L N E U R A L N E T S T R U C T U R E The SNN differs from other approaches to the use of neural networks in speech recognition in that it attempts to recognize each phoneme by using all the frames in a phonetic segment simultaneously to perform the recognition. The SNN is a neural network that takes the frames of a phonetic segment as input and produces as output an estimate of the probability of a phoneme given the input segment. But the SNN requires the availability of some form of phonetic segmentation of the speech. To consider all possible segmentations of the input speech would be computationally prohibitive. We describe in Section 3 how we use the HMM to obtain likely candidate segmentations. Here, we shall assume that a phonetic segmentation has been made available. The structure of a typical SNN is shown in Figurel. The input to the net is a fixed number of frames of speech features (5 frames in our system). The features in each 10-ms frame consist of 16 scalar values: 14 reel-warped cepstral coefficients, power, and power difference. Thus, the input to the SNN consists of a total of 80 features. But the actual number of actual frames in a phonetic segment is variable. Therefore, we convert the variable number of frames in each segment to a fixed number of frames (in this case, five frames). In this way, the SNN is able to deal effectively with variable-length segments in continuous speech. The requisite time warping is performed by a quasi-linear sampling of the feature vectors comprising the segment. For example, in a 17-frame phonetic segment, we would use frames 1, 5, 9, 13, and 17 as input to the SNN. In a 3-frame segment, the five frames used are 1, 1, 2, 3, 3, with a repetition of the
منابع مشابه
Estimation of Hand Skeletal Postures by Using Deep Convolutional Neural Networks
Hand posture estimation attracts researchers because of its many applications. Hand posture recognition systems simulate the hand postures by using mathematical algorithms. Convolutional neural networks have provided the best results in the hand posture recognition so far. In this paper, we propose a new method to estimate the hand skeletal posture by using deep convolutional neural networks. T...
متن کاملPersian Phone Recognition Using Acoustic Landmarks and Neural Network-based variability compensation methods
Speech recognition is a subfield of artificial intelligence that develops technologies to convert speech utterance into transcription. So far, various methods such as hidden Markov models and artificial neural networks have been used to develop speech recognition systems. In most of these systems, the speech signal frames are processed uniformly, while the information is not evenly distributed ...
متن کاملAn Automatic Dysarthric Speech Recognition Approach using Deep Neural Networks
Transcribing dysarthric speech into text is still a challenging problem for the state-of-the-art techniques or commercially available speech recognition systems. Improving the accuracy of dysarthric speech recognition, this paper adopts Deep Belief Neural Networks (DBNs) to model the distribution of dysarthric speech signal. A continuous dysarthric speech recognition system is produced, in whic...
متن کاملImproving Phoneme Sequence Recognition using Phoneme Duration Information in DNN-HSMM
Improving phoneme recognition has attracted the attention of many researchers due to its applications in various fields of speech processing. Recent research achievements show that using deep neural network (DNN) in speech recognition systems significantly improves the performance of these systems. There are two phases in DNN-based phoneme recognition systems including training and testing. Mos...
متن کاملA hybrid EEG-based emotion recognition approach using Wavelet Convolutional Neural Networks (WCNN) and support vector machine
Nowadays, deep learning and convolutional neural networks (CNNs) have become widespread tools in many biomedical engineering studies. CNN is an end-to-end tool which makes processing procedure integrated, but in some situations, this processing tool requires to be fused with machine learning methods to be more accurate. In this paper, a hybrid approach based on deep features extracted from Wave...
متن کامل